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ABSIPACT 

This primer is an intrcducticn to item response 
theory (also called item characteristic curve theory, or latent trait 
theory) as i-^ is used most commcnly--f cr scciing mtltiple choice 
achievement or aptitude tests. Hritter fcr the testing practitioner 

■ with minimum training in statistics and psychometrics, it presents 
and illustrates the basic mathematical ccrcefts needed to understand 
the theory. Then, building upon those concepts, it develops the ba,sic 

.concepts of^i-^em response theory: itei parameters, item response 
function, test characteristic curve, itei irfcrmaticn functions, test 
information curve, relative efficiency curve, and sccre information 
curve. The maximum likelihood and Bayesian medal estimates of ability 
are descri^bed nith illustrative examples. After a discussion of 
assumptions and available computer prcgrams, some practical 

.applications are presei^ted, i.e. equating scales, tailored testing, 
item cultural bias, and setting pass-fail cut-offs, (Authcr/CP) 
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BOOKMARK AND GLOSSARY 
of special terms and symbols 



A = # of alternatives in a multiple 

choice question 
a-value = discrimination index 
ASI = Alternative Similarity Inde)i 
b-value = difficulty index 
BME = Bayesian Modal Estinsition 
c-value = pseudo-guessing index 
CRT = Cathode Ray Tube device 
d-value = point biserial correlation 
d.f. = distribution function, an ogive 
E = Error score <t 

e = base of natural logarithm 

exp() = e raised to the power of whatever 

is in the parenthesis after the 

exp 

f. f. = frequency function, bell shaped 
^ curve 

1(9) = Test Information Curve 

1(9, u) = Item Information Function 

ICC ' = Item Characterjistic Curve, same 

^ as IRF - 
IIP . = Jtem Infonnatiqn Function, 1(9, u) 
n<F ' ='Item Response Function. 
IRT = Item Response Theory 
KR-20 = Kuder-Richardson Formula. 20 
L = Likelihood 

L(0,1.7)= Logistic Frequency Function 
L(9|IJ) = Likelihood of 9, given U 
L(U(9) = Likelihood of U, given. 9 
m = s-lope of the ogive at the b-value 

MAPL . = Minimum Acceptable Performance 
Level 

MLE - = Maximum Likelihood Estimation 
N(0,1) = Normal f.f. 

p-value - proportion of examinees selecting 
an item alternative 
• = P^(9) = Probability of getting 

item correct, given 9 
Q. = Q^(9) = Probability of getting 

item wrong, given 9 • 

r^n = item biserial correlation 
g9 

r . = interitem tetrachoric correlation 
gh 

r = reliability of classical test 

theory 

REC = Relative Efficiency Curve, ratio 
of TIC'S 
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SBAYES = Simplified Bayesian, same as 
BHE 

SD = standard deviation 
SEE. = Standard Error of Estimate 
SIC = Score Information Curve 
• SME = Subject Matter Expert 
T ' True score. Observed score - 

Error 

TIC = Test Informatixm Curve, 1(0), 
0(0, u) 

USCSC = U.S. Civil Service Commission 
U response vector, response 

pattern 

u = response, u- = 1 if response 

is correct & u. = 0 if response 
is wrong 

W(0) = optimal weight of an item 
X = Observed score 

X ^ Mean 

0 = Theta, the ability scale 

/ = Integral sign 

= Psi, logistic ogive 

= Phi, normal ogive 
^ * = Summation of a series of numbers 
^ = Product of c s^ies of numbers 



PREFACE 



One year ago I h^d never heard of latent trait theory, an jtem 
characteristic curve, or Fred Lord. On my first reading of Lord and 
Novick (1968) Chapters 16 and 17, I understood absolutely nothing. - 
After several hours of study on my second reading, I finally comprehended 
one simple equation. During tlys. next several months I reread parts 
of Lord and Novick as many as 20 times, I taught myself some differ- 
ential calculus, integral calculus, inathematical statistics, probability 
theory and linear algebra, I attended Fred Lord's course in Item 
Response Theory at the Educatio^pal Testing Service, Princeton, NJ, 
and I read several publications on Item Response Theory. 

I have, now gotten to the point where I am able to use Item 
-Response Theory for iw purposes, although there is still much that I 
do not understand. 

Upon reflection, I find that, as is true in many sciences, it is 
not necessary to fully ^understand the theoretical background and 
mathematical ^development in order to apply the results of the model. 

It is widely acknowledged in the field that one of the main 
reasons that item response thepry has been so slow to catch'on among 
testing practitioners is the mathematical complexity of the literature. 
Most of the literature is written with language and notation that is 
standard for the researchers. Howevier, that language and notation 
is confusing to the thousands of testing practitioners, whose technical, 
training amounts to a couple of courses in statistics and tests and 
measurement, if that much. On the other hand, many of the concepts 
used in the literature are not difficult to understand, if explained 
in less esoteric language and with a few examples. 
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^ Therefore, it became my, resolve that no testing practitioner, such 
as I, should have to go through what I went through in order to. 
gain a basic understanding of item, response theory.; The purpose of 
this paper is to fulfill that' res dive. - 

Since very little of this pajer is original with m'e, by 
rights there should be 'a reference for nearly every sentence or 
paragraph. Such complete references:, however, will not be included 
because they would be out of place fjor a primer, and usually not of 
interest to the novice. -Wy primary references are Lord & Novick (1968) 
and Lord (in preparafton) . Some references will be included to direct 
the reader to more thorough and detailed'explanatiqns. Other refer- 
ences will be included where authoritative. support is deemed desirable. 

A primer is necessarily incompl^ete. It is also inaccurate when ^ 
it contains oversimplifications which apply to the general case, but 
do not apply to extreme, unusual, or uninteresting cases. This paper 
will be guilty •'of such generalities and rules of thumb. 

Other excellent, less elementary introductory material is also 
available. (See Baker, 1977; Hambleton & Cook, 1977;' Sympson, 1-977). 

I am indebted' to ENS De^a Cook, ENS Pamela. Crandall , ENS Charles 
Pastine, and LtJG Larry Young for their assistance in the analysis of. 
data. . : ' 

0 •- - - • 

Ny appreciation for the many suggestions and corrections made by 
the several readers and revi-ewers is gratefully acknowledged. They 
are: John A. Burt, Joseph Cowan, Myron A. Fischl, Steven Gorman, Karen 
Jones, Frederick M. Lord, James ^r'McBride, , Alan Nicewander, 
Malcolm J. Ree, and James B. Syrapson. 

I would also like to thank YN2 Ron Smith for his excellent art 
work, and Jim Walls fjr his systems analysis and computer pro- 
gramming. * ■ ' , 

■ • ' THOMAS A. WARM 

^ January 22, 1978 
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CHAPTER 1 
INTRODUCTION 



1.1 Item Response Theory (IRT) is the most significant development 
in psychometrics in many years! It is, perhaps, to psychometrics 
what Einstein's relativity theory is to physics. I do not doubt that 
during the next decade it will sweep the field of psychometrics. It 
has been said that IRT allows one to answer any question about an 
item (test question), a test, or an examinee, that one is entitled to 
^ask. Although this statement is somewhat circular, it will give you 
an idea of the terrific power of IRT and of the mathematical estima- 
tion methods involved. 

The most common application of IRT is with multiple-choice 
questions in an ability test. That use will be the thrust of this 
paper, although IRT also applies as well to free response (fill in) 
items. I make no distinction between abilitv :>/r5 knowledge testing. 
IRT applies to tests for both. Thus, the ^ - h'ty" will be used 
for both types of tests. No application . ^rsonality or 

interest testing will be discussed. 

1.2 If we give several testsi in the same subject matter area to a 
group of examinees, we find that in general the same examinees score 
high on the tests and the same examinees score* low. In other words, 
we find consistency in the performance- of examinees on the different 
tests. 

To explain this consistency we assume that there is sometliing 
inside the examinees that causes them to score consistently. We call 
that something a menta-i, trait. ^ 
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■ In the vernacular the word "trait" implies an innate, inherited 
characteristic. We don't necessarily mean that. We mean only that 
characteristic of the examinee that causes consistent performance on 
the tests, whatever, .if anything, it is.^ 

No one has found a physical referent for a mental trait, and few 
really expect to. It is sometimes tempting to think of a trait as 
having a physical referent like a brain engram, but that is always 
unnecessary. In this sense, a trait is an intervening variable, as 
opposed to a hypothetical construcjt. Since the mental trait has no 
known physical r^eferent, it is never observed directly- Therefore, 
it is called a "tatent" trait. ^. ' 

1.3 The scale of the latent trait is traditionally given the name of 
the Greek letter theta (0). I will use the terms theta, ability level, 
amount of trait, and amount of subject-matter-knowledge, interchangeably 
Th^ is a continuum from minus infinity (-00) to plus infinity (+00). 
It 'haV> no natural zero point or "unit. Therefore, the zero-point and 



un 



iia:>> iiu • 

it->are 0 



ften taken as the mean and standard deviation, respectively, 
of some reference sample of examinees. Thus, values of 0 usually, vary 
from -3 to +3, but may be observed outside that range. The 9s of a 
sample need not be distributed normally. 

1.4 When an examinee walks into a testing room, he brings with him his 
theta.* The purpose of the test, then, is to. measure the relative 
position o; the examinees on the theta scale. The test interprets the 
examinee's theta and produces a measurement of ability, which is often 
the raw (number right) score. The test is the measuring instrument. 
Often measurement of an- ability with a test is made analogous to 
measurement of height with a tape rule. ^^-Ru^Hrhere is .an important 
difference. Height, whether measured byAnn English rule or metric rule, 
is alvyays on an equal interval scale. yHistograms of a group of people 
will always look the samev except fof som linear stretching of a 
scale. I 



*The generic masculine pronouns will be used for convenience. ' 




That is not the casejwith testing. The histograms of raw scores 
of the same people on tv^JU tests will seldom look the same, even with 
linear stretching of a sfcale. That is because each test has its own 
peculiar scale (also called metric). The peculiarity of a test's 
metric distorts the distribution of examinees. Until IRT there has 
been no way to identify the peculiar scale of a test. 



13 



CHAPTER 2 

Classical Test Theory vs. Item Response Theory 

2.1 Classic^il test theory has been developed over a period of many 
years. Gulliksen (1950) is an excellent presentation of classical test 
theory. 

f^os't testing practitioners use classical test theory, whether they 
know it or not. The basic tools of most testing practitioners are: 

a. p-value = proportion of examinees selecting an item alter- 
native (also called "item difficulty"), 

b. d-valu><= point-biserial correlation between the item al- 
ternative and the test (some use the biserial correlation) (also called 
"item discrimination"), - 

c. mean of examinees' (number right) scores, y 

■ / 

d. standard deviation of examinees' scores, ^ 

e. skewness and kurtosis of examinees' scores, ^ 

reliability of the test, usually KR-20, the Kuder-Richardson . 
Formula 20 (a -special ' case of Cronbach's coefficient alpha). 



. Anyone whose test analysis , is principally based on the statistics 
listed above is using classical test theory. The problem with those 

statistics is that they are relative to the characteristics of the test 
and'of the examinees. 
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The p-value is relative to the ability level of the examinees. 
The same iteir; given to a high ability group and low abi lit/ group will 
get two different *p-values for the two groups. It can be shown that 
p-values are not true measures of relative item difficulty. It is not 
.uncommon for items measuring the same ability to reverse the order of 
their p^values when given to^groups of different average ability. For 
example, item A may have a higher p-value than item B for one group of 
examinees, but have 'a lower p-value than item B for a' different group. 
This effect is not a matter of sampling error. 

The d-value is relative to the homogeneity of the ability levels 
of the examinees in the sample, the subject-matter homogeneity of the 
items in the test, and the dispersion of p-values of items- in the test. 
The same item, given to a group of examinees who are similar in ability 
and to another group with a wide range of ability, will produce two 
.different d-values for the two groups. Similarly, an item included in 
a test wi"th. other Items that are homogeneous in content and p-value 
will, get a d-value different from the d-value it will receive in a 
heterogenedus test.. ^ 

The mean, standard deviation, skewness and kurtosis will also vary 
according to the characteristics of the test and examinees. ' 

;The "reliability is relative to the standard deviation of the test, 
and to the p-values and d-values of the items in the test, all of which 
^ are dependent upon the particuflar abilities of the examinees and the 

characteristics of the test. ^ 

"V ■ ■ • • 

The following quote gives another liability of using classical 
test ,;theo ry in culture-fair testing studies: 

"It can be shown that classical parameters (e.g. p-value) will 
generally not be linearly related across subgroups of a population. 
This means that the test for cultural bias using classical parameters 
can lead uo an artifactual detection of bias." (Pine, 1377, p. 40) 
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Clearly, classical test theory statistics are meaningful onl^ in 
an extremely limited situation, i.e., when the same Hem is given to 
the same population as part^*of strictly parallel tests. Such a situ- 
ation rarely occurs. Furthermore, the basic precepts and definitions 
of classical test theory are. untestable, i.e. they are tautologies. 
They are simply taken as true without any way to empirically determine 
their relevance to reality. Some are assumed to be true eveh when this 
does not appear to' be warranted. Thus, no one knows if the classical 
test model applies to any real test. ' 

2.2 In contrast IRT makes possible item and test statistics which are 
dependent neither on the characteristics of the examinees nor on the 
other items in the test. They are invariant. With thecitem statistics 
it becomes possible to describe in precise terms the characteristics of^ 
the test before the test is administered. This capability allows one to 
construct a test that is highly efficient in accomplishing the purpose 
of the test. It also provides an extremely powerful tool for special 
studies, such jis item cultural bias. 

Moreover, the assumptions of IRT are explicit and have the po- 
tential of empirical testing. It is possible to discover if the data 
reasonably meet the assumptions. 



CHAPTER 3 

A Brief History of Item Response Theory 

3.1 The origin of latent trait theory can be traced to Ferguson (1942) 
and Lawley (1943). Item Response Theory is just one of several models 
^ under latent trait theory. The Rasch model is another. 

•. 3.2 Other early publications using some of the same concepts ace 
Brogden (1946), Tucker (1946) Carroll (1950), and Cronbach and Warring- 
ton (1952). 

3.3 In 1952, Lord published his Ph.D. dissertatipn in which he pre- 
sented IRT as a model or theory 'in its own right. At that time he 
called it Item Characteris,tic Curve Theory. Thus, Lord is-" considered 

, the father and founder of IRT. Shortly after publishing his disser- 
tation. Lord stopped work on IRT for ten years, due to a seemingly 
intractable problem with it.* 

3.4 In 1960, Rasch, (1960) published his one-parameter sample-free 
model". The Ra^ch model stirred much interest and considerable work was 
done on it during the next decade, its leading proponent in the U.S. 
is Benjamin Wright, a psychoanalyst at the University of Chicago. (See 

^ Wright,: 1977 for references). ' 



3.5 In'1965, Lord (1865) conducted a massive s . jdy, using a sample 
size of greater than 100,000. That study showed that the "problem", 
which had deterred his work for so long, was not really a .problem, and 
that IRT was 'appropriate for real life multiple-choice tests. With 
that study Lord began. work again on IRT. , " 



*Th1s problem is discussed in Section 14.2 



3.6 In 1968, Lord arid Novick published a psychometrics textbook, 
within which were four Chapters (17 '20) by Allan Birnbaum (1968), a 
well-known statistician (now deceased). Birnbaum's chapters worked out 
in detail the mathematics of the two and three parameter normal ogive 
and logistic models.* 

3.7 Soon thereafter Urry (1970) completed his Ph.D dissertation in 
which he compared the one, two, and three parameter models. He con- 
cluded that the three parameter model best described the real world for 
multiple-choice tests. 

3.8 Since Urry's dissertation, much work has been done on all three 
models (i.e., one, two, and three parameter), but the three parameter 
model is now receiving most of the attention because it best describes 
reality. To'wit, I shall deal with the 3-parameter model only. 

3.9 Kuch of the work on the 3-parameter model is'coming from 3 pri- 
ncipal sources. The sources are: 

a. Frederic M. Lord, Distinguished Research Scientist,- Educa- 
tional Testing Service, Princeton, NJ. " . . 

b. Vern W. Urry, Personnel Research Psychologist, . United States 
Civil Service Cownission, Washington, D.C. 

•if 

c. David J. Weiss, Prof, of Psychology, Psychometric Methods 
Program, University of Minnesota, Minneapolis, MN. 

There are, of course, many other highly productive researchers 
publishing excellent studies. Failure to include them in this list is 
more an indication of rry limited exposure than of the significance of 
their contributions. 



*The normal ogive and logistic \ogive will be compared briefly in 
Chapter 4. - IS 



3.10 The United States Civil Service Comnission has a*pted a pa- 
rticular application of IRT as official policy. The ve U.S. armed 
forces (including the U. S. Coast Guard) are also investigating the 
applicaftion of IRT. 

3.11 In 1977 Lord changed the name of his model from Item Character- 
istic Curve Theory to Item Response Theory. 
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CHAPTER 4 



The Normal Ogive and Logistic Ogive 

4.1 I trust the reader will recognize the normal curve plotted in 
Figure .4.1 with the pluses {++++). It has a rnean =0, and standard 
deviation =1. The formula for this normal curve is identified in 
Figure 4.1 as N(0,1). 

4.2 A bell-shaped curve like this is called a frequency function 
(f.f. ). It is called a frequency function even when the ordinate' 
(vertical axis) is defined as frequency, proportion, percent, or 
density (Kendall and Stuart, 1977, p. 13). Therefore, we call the 
normal curve, the "normal frequency function." 

4.3 Superinposed over the normal f.f. in Figure 4.1 is a logistic* 

curve or logistic frequency function, plotted with dots { ). 

This logistic f.f. also has a mean =0 and standard deviation 1.0. 
The formula for this logistic,f .f. is identified in Figure 4.1 as 
L(0,1.7): The 1.7 in the exponent of the formula is chosen to allow 
the logistic f.f. to approximate the normal f.f as closely as possibl 
^he actual value is 1.6679, which is rounded to 1.7. In some of the 
literature the 1.7 is represented by the upper case letter D. The 
letter e is the base of natural logarithms; e 2. 718281828. 

4.4 The reader will also recognize the Srshaped curve in 'Figure 4.4 
as the normal cumulative frequency. curve. An S-shaped curve is 
called an ogive.** This curve gives' the proportion of area under the 
normal curve (Figure 4.1) that lies to the left of each point on the 
abscissa (horizontal axis). 

*pronounced lojistic ^ 
**pronounce.d ojive , • 
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c.f. 
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44 Distribution function (4f.)for the 
N(fl,l)an(lL(0,lJ)frequencyfraictions.. 



4.5 k ooive like this is called a distribution tetion (d.f). It 
is called a distribution function even when the ordinate is defined as 
cuiilative frequency, cuiwlative proportion, cumulative percent, or 
cumulative area (Kendall S Stuart, 1977, p,13), Therefore, we call the 
curvei'n Figure 4.4 a "norml distribution function," or a ''nQril 
ooive", Ihe forila.for this norral d.f. is identified in Figure 4.4 
as/N{0,l); 

4.6 Also in Figure 4.4, but not discernable, is the logistic ogive . 
(or logistic d,f.) for the logistic f.f, in Figure 11. It, is not 
discernable, because it is so close to the normal ogive that on this 
scale the two curves rerge together in the width of the ink line. A 
srall portion has been magnified to a larger scale {10x)j so that'the 
difference may be seen, The magnified area was chosen at the place 
where the 2 ogives are farthest apart. The readercan verify that at 
any point on the abscissa the I ogives are always less than .01 apart 
on the ordinate, as is indicated by the inequality under the igni- 
fication in Figure 4.4, The formula for this logistic d.f, is id- 
entified in Figure 4.4 as /l{0,1.7), 

1.7 The ogive with which we are concerned is the normal ogive. • 
However, oote the integral'sign {/) on the riglit side'of the de- 
finition, for the fl 



r, 1 



The integral sign there means that no^algebraic function can be ■ 
found to describe the noril ogive. This fact makes the normal ogive 
very cumbersoi to work with iratteiratically, M requires numerical 
fietWs to solve, or a table of values. 
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4.8 On the other hand the logistic ogive has no '''nteoral sign on the 



on the right in Figure 4.4 is the algebraic function describing the 
logistic ogive. The logistic ogive is very easy to work with.* 

4.9 For these reasons the logistic ogive is substituted as a con- 
venient and very close approximation to the norrral ogive. 

4.10 This paper will only deal with the logistic ogive. Statements 
about the logistic ogive may be taken as close approximations to the 
normal ogive model. The logistic f.f. is no longer of interest to us. 



*Some interesting logistic identities are given in Appendix A. 



right side of its definition 




In fact, the expression 
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CHAPTER 5 
More About Logistic Ogives 



5.1. Figure 4^4 shows just one logistic ogive. There is actually an 
infinite family of logistic (and normal) ogives, each different in 
' some way from every other one. 

5.2 Logistic ogives are strictly monotonic functi-:^ns. They are 
strictly monotdnic Lscause, going from left to right, the ogive 
always gets higher and higher, never is completely horizontal, and 
never goes down. 

5.3 Ndtice the ogive in Figure 4.4. Between -2.0 and -0.5 on the 
horizontal axis the ogive is concave upward. Between 0.5. and 2.0 it 
is concave downward. At s^rt^point between -0.5 and 0.5 this ogive 
must change from_ being concave upward to concave, downward. That, 
point is called the "inflection point." The inflection point is 
always the point where the slope of the onive is at its maximum. The 
inflection point for this ogive is located on the vertical axis at 
.50, and oh- the horizontal axis at 0.0. 
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5.4 Three-parameter logistic ogives (with which we are exclusively 
concerned) may differ from each other in only 3 ways, one for each 
parameter. 

5.5 One way in which logistic ogives may differ is in the horizontal 
location of the inflection point. Figure 5.5 shows 3 logistic ogives 
labeled E, F, and G with their inflection points at different places on 
the abscissa, rou can see that the 3 ogives are exactly the same 
except for a sideways shift of the entire curve. Shifting the inflec- 
tion point sideways, shifts the entire ogive sideways. The horizontal ' 
position of the inflection point is called the "b-parameter". Some 
call it, as we will, the "b-value". The b-values of ogives E, F, and G 
in igure 5.5 are -.5,0.0 and 1.0, respectively. 

5.6 To include the b-parameter in the logistic ogive function, it is 
only necessary to subtract the b-parameter from the horizontal axis 
variable. " ' 

5.7 Figures 4.1, 4.4, and 5.5 were constructed with the horizontal 
axis labeled z. This label was chosen to facilitate understanding of 
the logistic f.f and d.f. , because of the reader's likely familiarity 
with the traditional z-scores of measurement. Since we are concerned 
With the ability scale called 0, we now and hereafter label the hor- 
izontal axis, 0. Substituting 0 for z in the logistic function, 
and subtracting the b parameter, gives the height of the logistic 
ogive by the function 





whicli is soiiietiiiies written 



"3 .2. -I 0 
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aim, 111 n 
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Hiy:,3+. 


;^il,7(9.0,0] 


JSy:,25+ 


Hil7(9.0,0j 


K5y»,l5+. 


(I-,I5) ^ 


l^jlJ(9-0,0) 
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l+expH,7(e-b)) 



■ where exp means e raised to the power, of whatever is in the paren- 
thesis after the ekp. Ihe ypper case Gr'ee^ letter psi (f ) is osed 
in the literatyre to ran the logistic ogive. Phi (^) is ysed to 



■■5.8 The logistic ogive has 2 asymptotes. The asymptotes are horizontal 
lines that the ogive approaches .at its extreres, byt never qyite 
reaches.. The ypper asymptote is located on the vertical axis at 
1.00. In Figyres I.I and 5.5 yoy can see that the ypper, right part 
of the logistic ogives approach the valye of 1.00 on the vertical 
.axis." In the figyres it may appear as thoygh they toych the hori- 
zontal line at 1.00, byt, strictly speaking, they never qyite do. 

5.3 Ihe lower asymptotes for the ogives in Figyres il and 5.5 is ■ 
the horizontal axis with a height of zero. Jyst as the ypper part of 
the ogive never qyite reaches 1.00, the lower part of the ogive never 
gyite reaches the lower asymptote. 



5.10 All logistic ogives in IRT have an ypper asymptote at 1,00, byt 
not all have a lower asymptote at .00, In fact, few do. 



5.11 Figyre 5,11 shows 3 logistic ogives, labeled H, J, and K, which 
are identical except forlifferent lower asymptotes. The lower . 
asymptotes are at .15, .25, and ,30 on the vertical axis. The 
b-valye for each ogive = 0. Note that the ypper asymptote for all 
3 ogives is at 1.00, 
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5.12 Note also that the inflection points (all located, at 0.0 on the , 
e scale) for the ogives in Figyre 5.11 are at different heights. In 
fact, they are half-way between their asymptotes. That is always the 
case. The inflection point of the logistic ogive is always half-way 
between its ypper and lower asymptotes. 
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5.13 The lower asymptote is called the c-parameter or the c-value. It 
is another of the 3 parameters of IRT. 

5.14 The effect of the c-value is to squeeze the ogive into a smaller 
vertical range. The reduced range is equal to 1 - c. The effect of 
the reduced vertical range is to reduce the slope of the ogive at every 
point on the 0 scale, other things being equal. We include the c- 
parameter in the logistic function by multiplying by 1 - c, and adding 
c.^ 



^(0)=c+(|-c)[l+e'-'^(®-'')]'' 



which IS the same as 



T (©)=?+ (l-c) [j+exp(-l.7(0-b)^ 



and 



(l-c) 



The c-values of ogives H, J, and K in Figure 5.11 are .30, .25, 
and .15, respectively. 
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5.15 The third (and last) parameter of IRT is (you guessed it) the 
a-parameter, or a-value. • , 



5.16 The a-parameter is related to the slope of either ogive at 
the inflection point or in other words at the b-value. For the normal 
ogive model (with c = 0.0) \ 

a» ^Ztt m» 2.5nn 

where m is the slope of the ogive at the b-value. 

5.17 Figure 5.17 shows 3 logistic ogives (L,M,&N), which aVe identical 
except for their. a-values = .3, .8 and 2.0, respectively, with b = 0.0 
and c = .00. As you can see, the larger the a-value, the steeper the 
ogive. Specifically, 

, 0. [^"(e)-b] 

where yfe)= the point on 0, where the height of the ogive = c + .8455(1 
The -1 that looks like an exponent ofSK^j^ is not an exponent at all, 
but indicates the inverse of the function. Typically, a function is 
used by starting at some point on the abscissa, going vertically to th& 
function, and then horizontally to the ordinate. The inverse procedure 
would be to start at a point on the ordinate (in this case at c + 
.8455(l-c)), go horizontally to the fi;nction, and then drop down to the 
abscissa (0).' That point on 0 is^^(0) . The -1 outside the brackets 
is an exponent, which means to take the reciprocal. The number .8455 
is the proportion of area under the logistic f.f. and to the left 
of z-score = 1 (see Figure 4.1). The z-score = 1 is an arbitrary 
mathematical 7y convenient point. 



5.18 The a-parameter enters the logi\stic function as part of the 
exponent of e. • 



-/.7a(0-b) 
I +e 



This formula is the 3-parameter logistic ogive. It will look 
rather ominous to the novice. However, it is not difficult with a 
pocket calculator with an eX key and a 1/x key. It is highly instru- 
ctive to go through the calculation of several points of a typical 
logistic ogive and to plot them. An opportunity to do so is provided 
below for an ogive with a =.9, b = -.4, and c = .2. The reader can 
verify the results in Figure 5.18, which shows this logistic ogive with 
its characteristic parts labeled. ' 
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Pocket Calculator Insttuctions 
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a = .9 

b = -.4 ^(6) = c + 
c = .2 



.75 

v(e) i 



50 



.25. 



-3 



l+e 



(1-c) 
-1.7a(e-b) 



Enter 


Key 


Ccnnnent 


e 










minus 






D 




X 


times 






a 




v 


times 


-1-7 




constant 




X 


-1.7a(e-b) 




e 






+ 


plus 


1 




constant 




1/X 


reciprocal 




X 


times 


.8 




1-c 




+ 


plus 


.2 








s 


4^ (e) 



-2 



Record yourH'oj here 



3 

2.5 
2 

1.5 
1 • 
.5 

0 

-.5 
-1 

-1.5 
-2 

-2.5 
-3 



.916 

.839 

.569 



Now plot^ (e) vs. e below. 
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Upper Asymptot* 



iJOO 



c +.8405 d-c] 



V(ey 

.5(l+c). 



.75 



.50 



Lower Asymptote 25- 
. C_i 



.00- 




-3 



Figure 5.18. A Shree-parameter logistic ogive witfi a 
— .9, b = -A, asicl c = .2 with its characteristic parts 
labeled. 
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CHAPTER 6 
The Item Response Function (IRF). 



6.1 Let's consider 2 examinees (Al and Bob) with different ability 
levels, i.e. different 0s. Let's say Al has a higher 0 than Bob. That 
means they are located at different places on the ^0 scale. See Figure 
6.1. 



II 




.4... 



Fieure 6.1. The ability scale {9) with two hypothetical B 
individuals (Al and Bob) located on it T 
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6.2 What are the chances that AT will get item #1 correct? What are- 
the chances that Bob will get item #1 correct? So far we don't know 
the answer to either of those questions. But We do know one thing. Al 
has a better chance of. getting item #1 correct than Bob, because Al is 
smarter ^than Bob (iis ability 9). So let's represent the probability of 
each getting the item correct by a point above each (points A & B) in 
Figure 6.2. 



Figure 6.2. The probabilities of Al and Bob getting 
Item ^ 1 correct as afunction of their abilities. 



6.3 In doing so we have defined an ordinate as the probability of 
getting the item correct as a function of 0 (ability). This may be 
written P^. (R'/O), and read, "the Probability of getting item i correct 
given (() 0." But for brevity it is usually written P^-(9). The 
subscript (ij is often omitted. 



\ 



^0 



6.4 Now let's take Carl, who is dumber (less ability 9) than Bob. 
Carl has an even smaller chance of getting the item correct. See 
Figure 6.4a. 




Figuife 6.4a. The probabilities of Al, Bob, and Carl 
getting;^Item # 1 correct. 



And let's also add Dave, and Ed and Fred who have less 9 still. See 
Figure 6.4b. 




Figure 6.4b. The probabilities of Al, Bob, Carl, Dave, 
Ed, and Fred getting Item # 1 correct. 



And we can add Olga, who is very bright. See Figure 6.4c. 

I 1 I I 




Figure 6.4c. The probabilities of Al, Bob, Carl, Dave, 
Ed, Fred, and Olga getting Item # 1 correct. 



41 



4i 



6.5 Since the probability of getting the item correct is only a 
function of the amount of ability,* we can say that any who has 
the same 0 as Al will have the same probability as Al of getting 
the item fcorrect (A). And, everyone who has the same 9 as Ed will 
havethe same probability as Ed of getting the item correct (E), 
and so on. Therefore, we can connect the points in Figure 6.4c, 
which will tell us the P(0) for each 9. This curve is called the Item 
Response Function (IRF) and was until recently called the Item Char- 
acteristic Curve (ICC). See Figure 6.5 
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Figure 6.5. The Item Response Function of Item 
# 1. 



6.6 We know several things about this IRF. 

(1) It cannot rise higher than 1.0, because a probability = 1.0 
is a sure thing, and nothing can be more probable than a sure thing. 

(.2) It will never reach a height of 1.0, because in testing there 
is no such thing as a sure thing. Therefore, the curve has an upper 
asymptote of 1.00. 

(3) Between Ed and Bob the curve has to rise rapidly, because it 
must rise from point E to point B in the short distance between Ed's 
0 and Bob's 9. 



♦assuming unidimensionality , which will be discussed in Section 14.4. 
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(4) The curve must always rise (i.e. can never be horizontal or - 
go down) as we move from left to right, because as ability increases, 

so does the probability of getting the item correct. Therefore, the 
curve is strictly monotonic. 

(5) It cannot go below 0.00, because a probability = 0.00 is an 
absolute impossibility, and nothing can be less probable than an 
absolute impossibility. Therefore, the curve has a lower asymptote. 

(6) Since the item is a multiple-choice question, there is 
usually a fair probability of getting the item correct strictly by 
chance alone, no matter how low the 0. Traditionally, we have taken 
this probability to be 1/A, where A = the number of alternatives in the. 
multiple-choice question. A 4-choice item has been thought to have a 
chance probability of 1/4 = .25, and a 5-choice item, a chance pro- 
bability of 1/5 = .20. Whatever thechance probability of getting 

a multiple-choice item correct is, it is not expected to be zero. 
It is expected to be somewhat greater than zero. Therefore, the curve 
in Figure'6.5 is expected to have a lower asymptote above zero. (In 
Section 7.3 we shall see that the lower asymptote is seldom 1/A) 

.6.7 You have probably noticed that all of the things we observed about 
. the IRF are also true about the 3-parameter normal ogive and logistic 
ogive. 

Therefore, we conclude that the normal (or logistic) ogive may be 
used to describe the IRF very well. And we may use the logistic ogive 
function to describe the IRF mathematically. 
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6.8 If somehow we knew and we were to plot the probabilities of 
getting item #2 correct 'for Al, Bob, Carl, Dave, Ed, Fred, and Olga, we 
might get an IRF like Figure 6.8. 
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6.9 Figure 6.9 shows both item #1 and item #2 . 



IjOO 



p(e) 



00 




rriMt ifcz 



Figure 6.9. The IRFs of Items # 1 and # 2. 
For 01 ga, Ed and Fred (and anyone else with their 9s) the probability 
(Pa(9)) of getting item 2 correct is about the same as their P,(9) for 
item #1. 



But item #2 is harder for Al , Bob, Carl, and Dave than item #1, 
because for all of them the probability of getting item #2 correct 

IS lower than the probability 'of getting item #1 correct. And 
it would be* harder for anyone who has tt^e same ability as Al, Bob, 
Carl , or Dave. 

6.10 We also notice that the probabilities of getting item #2 correct 
for Bob, Carl, Dave, Ed and Fred are all about the same. Item #2, 
then, does not do a good job in distinguishing among people with 
abilities like Bob's or below. This observation is consistent with 
what we intuitively understand about items.-. A hard item does not 
discriminate among low ability people, because they all get it wrong 
(unless they make a lucky guess). An easy item does not distinguish 
among high ability people, because they all get it correct. A test 
composed of items with IRFs like item #2's IRF would not be a good test 
for measuring the relative ability of people like Bob, Carl, Dave, Ed 
and Fred. 



Note: In practice, any particular examinee may either know the answer 
to a particular item (in which case his probability of getting it 
correct is 1.00), or not know it (in which case his probability of 
getting it correct is chance). Strictly speaking, we can not talk about 
the probability, of ^ particular person getting correct a particular 
item. However, for pedagogical reasons we will violate this restriction 
in this section. (See Section 8.2 fj&r clarification.) 
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6.11 However, Olga's ^(0) for item #2 is much higher than ATs 
§^(9). Therefore, item #2 will distinguish between people like Al and 
Olga. If a distiiiction in that range of ability is our purpose, then 
a test made of items like #2 would, be a pretty good test. 

6.12 Item #3 might have an IRF like that in Figure 6.12. This item 
rises over a lor>ger range than does either item #1 or item #2, but its 
slope is less at every point during its rise. This low slope means 
that item #3 is discriminating over a wide range of 0, but is not 
doing so well at any particular 0. 



p(e) 
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FRED 



,M f L t 

ED DAVeCARL BOB G AL 
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Figure 6.12. The IRF of Item # 3. 



6.13 Figure 6.13 shows the IRFs for both item #1 and item #3. 




ITEM llbS ^ 

FRED 



♦ ♦ f ♦ 4 

ED DAVE CARL BOB AL 

e 

Figure 6.13. The IRF of Items # 1 and # 3. 
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It is. interesting to note that item #3 is harder than item #1 for 
Al and Bob, but easier for Dave, Ed, and Fred. This possibility of 
reversed relative item difficulty for persons of different ability is 
one of the surprising results of IRT. 

6.14 . We have seen that the greater the slope of the IRF, the greater 
the discrimination, but the smaller . the range of discrimination. We 
have already noted in Chapter 5 that the a-parameter of the logistic 
ogive describes its slope. Therefore, the a-value is called the 
discrimination index of the IRF. The greater the a-value of the IRF, 
the better the item discriminates. 

6.15 Also apparent is the fact that the shift of the IRF as a whole 
to the left makes the item easier in general, and to the right makes 
the item harder in general. The left-right shift of the logistic ogive 
is described by the b-parameter. Thus, the b-value is the difficulty ■ 
index of the IRF. The more difficult the item is, the larger (in the 
positive direction) the b-value of the IRF. 

6.16 The IRFs of items 1, 2, and 3 have different lower asymptotes. 
Since the IRF never goes below the lower asymptote, this difference in. 
IRFs means that the items are of different difficulty even for exam- 
inees of very low ability. But examinees of very low ability will 
know almost. nothing about the item, and therefore have to guess. The 
difference in lower asymptotes of IRF's means that v&ry low ability 
examinees have a better chance of guessing the correct choice of some 
items than of others. This result of IRT will be discussed further in 
Section 7.3. The lower asymptote of the logistic ogive is the c- 
parameter. The c-value of an IRF is called the "guessing index" or 
more properly the "pseudo-guess ing index" of the item. Botii terms are 
used. ■ 
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Figure 6.17. The IRFs of four actual items from the 
Coast Guard Knowledge section of the. U. S. Coast 
Guard Warrant Officer Test, series 8. 




6.17 Figure 6.1^ shows the IRF's for 4 actual iters from the Coast 
Guard Knowledge Section of the U.S. Coast Guard Warrant Officer test. 

^Item #17 is a very difficult, but highly discriminating item- It has a 
c-value of .00, which means that nearly all examinees below 0=1, 
answered the item incorrectly. Item #17 is a very unusual item in two 
respects, its extremely high a-value, and .00 c-value. It is, however, 
an ideal item for many purposes. 

Item #21 is an easy item with somewhat low discrimination- Item 
#47 is slightly easier than #21, but has good discrimination. Item #50 
Is an Item with medium difficulty, and poor discrimination. 

6.18 The IRF should not be confused with the item-test curve. The 
item- test curve has raw score as the horizontal axis instead of 0- 
The item-test curve, therefore, suffers from the same problems of 
distorted scale as the raw score. The item-test curve has no par- 
ticular shape, and is not independent of the other ittins in the test. 
In fact, the average of the item-test curves of allitems in a test is 
always a straight line of slope = l(i.e. 45°). Thus, for many purposes 
the item- test curve is useless as an analytic tool. 
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CHAPTER 7 
The a, b, & c parameters 



7.1 The a-value is the discrimination index of the item. If Q is 
normally distributed, in the normal ogive model the a-value is related, 
to the d-value in the following very complex way (from Schmidt, 1977). 



>v/(KR-20)(/-c)2y2-d2pq 

where d = d-value, the point biserial item- test correlation 

p = p-value, the proportion of examinees correctly answering the Item 
q = 1-p 

KR-20 = Kuder-Richardson formula 20 reliability 

y = the height of the N(0,1) curve at the z score that cuts off 
P' proportion of the area under the N (0,1) frequency function. 

c = c-value 
1-c 



The a-value is related to the slope of the IRF, and can range from 
0.0 to + 00 just as the slope can. Negative slopes are possible, but 
not of interest to us. Experience has shown that a-values of typical 
items vary from about .5 to 2.5 with most from 1.0 to 2.0. The highest 
I have observed is'3.76. An item with a low a-value discriminates 
poorly over a wide range of 0. With a high a-value the item discri- 
minates well, but over a small range of 0. Items with a-values below 
.80 are not very good items for most purposes. 

7.2 The b-value is the difficulty index. If 0 is normally distributed, 
it is related to the p-value in the normal ogive model (trom Schmidt, 
1977) in the following way: 



yz(/-c)VKR-20 




where. z ■ the z-score that cuts off p' proportion in the upper portion" 
of the area under the N(0,1) frequency function, and the other symbols 
are as defined in Section 7.1 above. Typical b-values range from -2.5 
to +2.5. A b-value of -2.5 indicates the item is very easy. An item, 
with a +2.5 b-value is very difficult, and items with 0.0 b-values are- 
of medium difficulty. 
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7.3 The c-value is the guessing parameter or pseudo-guessing para- 
meter. It indicates the probability of examinees with very low 
ability of getting the item correct. Most c-values range from .00 to 
.4fl. Items with c-values of .30 or, greater are not very good items. 
It is desirable to have the c-value at .20 or less. The lower the 

, c-value is, th^^tter. A zero c-value is ideal. Typically, the 
c-value is about 1/A - .05, where A = the # of alternatives. Thus, 
4-choice items often have c 58?. 20 (i.e. .25-. 05), and 5-choice items 
often have c^C%lS (i.e. .20-. 05). 

Items do not have a c-value of 1/A because examinees do not, in 
fact, guess randomly when they do not know the answer (as has often 
been assumed in classical test theory analyses). 

7.4 Two explanations have been offered for the fact of non-random 
guessing (c^l/A). 

Lord has suggested that item writers are very clever in writing 
distractors that are very attracj:ive to low ability examinees. Thus, 
when low 0 examii;iees do not know the answer they are attracted more to 
distractors than to the correct answer, and so get the item wrong more 
often than if they guessed randomly. 

The other explanation is my own, based upon personal knowledge of 
item writing and test taking behavior: 

(1) When an item writer sits dowi: to write items, he, for the 
moment, is^not concerned with the distribution of the correct answers 
(the keyed choices) among the four (for four-choice items) possible 
positions .(i.e. choice A, choice B, choice C, and choice D). 
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(2) He has a tendency to try to hide the correct choice. In a 
four-choice item there are only L places to hide it - choice B, or 

' choice C. Therefore, he writes many more items, 'keyed B or C than A 
or D, and in fact there seems to be a much stro^c|B^* t^dehby tbWdf^d C. 
(I have verified this tendency with many item writers). This also 
seems to be true for 5-choice items. 

(3) When he finishes writing the items, he tabulates the numbers 
of items keyed for each position, and usually firWs that he has many 
more C's than A's, B'^, or D's (or E's in 5-choice items). 

(4) Most testing organizations have a requirement that there - 
should be about equal numbers of items with the keyed choice in each 
of the 4 or 5 possible positions. I 

(5) The item writer then. begins to revise the order of the 
choices in items to decrease the number of items keyed C, and increase 
the number of items keyed A and D and maybe B. He continues to revise 
the'order of the choices of i-tems until he has satisfied the require- 
ment of abcut equal numbers of keyed choices in each position. 

(6) Naturally, to. save himself work and time (the Law of Least 
^ Effort) he wants to revise as few items as possible. Therefore, he 

stops revising items when he gets within the requirement of about 
equa^ '^umbers. Because he started with more items keyed C, he also 
ends up with more items keyed C (but not as many), because he only 
needs about equal numbers. 

If the above scenario is as universal as I believe, it means 
that, in the set of all multiple-choice items in the world, more are 
keyed C than any other choice. It is true of almost all of the tests I 
have checked. 
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'There is a widespread rule of thumb aniong examinees: "If you 
don't know at all, guess C." I have heard this rule of thumb from 
coast to coast, from high school and college students, and from 
civilian employees and military personnel taking promotional tests, 
I db not know the source of this rule of thumb, but it is possible 
that the rule of thumb gradual ly^grew from examinees' observations 
of Tt\e frequency of keyed choice positions, as I have suggested 



Whatever the origin of the rule of thumb, it represents rational 
behavior, given a higher frequency of choices, keyed C, among the 
population of aJI multiple-choice items. By choosing choice C (when 
you don't know at all), you will get more items correct by chance in 
the long run than by guessing at random. 

This analysis suggests that the c-values of items keyed C will 
be higher than for items keyed A, B, and D. I was able to test this 
liypothesis with 127 items from 6 forms of the verbal parts of the 
SCAT-II series of tests, published by the Educational Testing Ser- 
vices, Princeton, NJ. The c-values were provided by Fred Lord. 
A two-by-twQ frequency table of A, B, D vs C by above-average c-value 
vs below-average c-value yielded a Chi square significant Keyond the 
.001 level. This result strongly supports the hypothesis that low 
ability examinees get items keyed C correct more often than they get 
items keyed A, B, or D correct. 

The results suggest 2 alternative courses of action for testing' 
organizations. 

(1) Require that there be exactly the same number of keys 
in each position. This action would thwart the test-wiseness 

of those who use the rule of thumb. However, it represents an 

undesirable rigidity. 



^bove: 






(2) A better course of action would be to key C for less than 
1/4 of the Items" (for 4-choice items). This action would cause 
a lower average c-value for the test. The lower average c-value 
- v/ould increase the total information in the test, which as we 
will see in ^c. 9.4 is highly desirable. 

7.5 The Rasch model assumes that ail items in a test. have the same 
a-value, and that c = .00 for all items. Both assumptions are nearly 
always unrealistic. 
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CHAPTER 8 

\- The Test Characteristic Curve 



1 



8.1 The scale of 9 is continuous, but since most of the calculations 
are done on digital computers, 9 is usually broken into small, dis- 
crete intervals of .05 9 units, and values of P(9) are calculated for 
each .05 interval from 9 = -5.0 to 9 = +5.0. ^ The very broad range 
from -5.0 to 5.0, and the small .05 intervals are used in the interest 
of accuracy. Larger or smaller intervals and a broader or narrower 
range may be used 'depending on the jxrrpose and degree of accuracy 



desired. 



8.2 Table 8.2 bploy giv^ the P(9) for 17 values of 9 for each of the 
4 i terns , s ho wn in Vi-§u-p6'- 6.17. 
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. .#17 


#21 


#47 


#50 


Zp(g) 


f- 

-2.7 


.00 


.30 


.38 


.. -20 


.88 


-2.3 


;00' • " 


.33 


.40 


.23 


.96 


-2.0 


.00 


.37 


.45 


.25 


1.07 


1 7 
-1. / 


.00 


.43 


.52 


.28 


' 1.23' 


-1.3 


.00 


.53 


.66 


.33 


1,. 52 


-1.0 


.00 


.71 


' ■ .87 


.44 


^ 2.02 


-.7 


.00 


CO • 
.DC 




/I c 

.4o 


1 .77 


-.3 


.00 


.82 


, -.94 


.52 


2.28 


0 


.00 


.88 


.97 


.59 


2.44 


o 
.0 


V- 00 


.92 


.99 


.65 


2.56 


. / 


. 00 


J .96 


.99 


.74 


2.69 


1.0 


.01 


.97 


.99 


.79 


2.75 


1.3 


.04 


.98 


.99 


.84 


2.85 


;-7 


.35 


.99 


.99 


.89 


3.22 




.78 


.99 


.99 


.91 


3; 67 


2.3 


.96 • 


.99 


.99 


.94 


3.88 


2.7 


.99 


.99^ 


.99 


.96 


3.93 



Table 8.2 

An item is scored dichotomously, which means the examinee either 
gets the item correct (for which he ^et^ an o'bserved score of 1) or 
he gets the item wrong (for which he gets an observed score of 0). ' 
The dichotomous scpre is a result of the typical use of multiple- 
choice items. An examinee's dichotomous score (0 or 1) is not a 
very accurate measure of his knowledge. 
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P(9) may be interpreted in two ways. A P(0) = .78 means both: 



(1) 78% of the examinees with the given 9 will get the 
item correct, and 

■/ 

(2) An examinee will get correct 78% of the items f 
which his PiQ) -^"^JsT 



If an examinee answers 100 questions for all of which his P(9) 
= .78, he is expected to get 78 items correct and 22 items wrong for a 
% score of 78%. If there were some way to give him partial credit of 
.78 points for each of the 100 items instead of 0 or 1 point he would 
also get a % score of 78%. This notion of partial crfedit for an item, 
depending on his P(9}, leads to the idea of a true score on the item. 

It is often not true that the examinee is 100% or 0% certain of 
his answer. Yet on a muUiple-choice item he either gets full (100%) 
credit for the item (1, if he gets it correct) or no (0%) credit 
(0, if he gets it wrong). The examinee's degree of ce'rt-^iinty, if 
measurable could be taken as a more precise measure of his knowledge. 
P(9) might be interpreted as this measure of his' knowledge, and is 
called his true score on the item. The sum of his true item scores 
is his true test score. His true test score is the raw score he 
wodrld get, if there were no measurement error in the test. 

The far right cdfumn in Table 3.2 is the sum of the P(9)'s of the 
4 items for each of the listed points on the 9 scale. The^P(9) is • 
the true te?t score of an examinee with a given 0 on a test composed 
of the 4 items. 

-4> , ■ •' 
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4.00-, 



'4 



3.00 
TRUE SCORE 

S P(e) 2.50. 

2.00. 



TEST CHARACTERISTIC CURVE 

WO-8 
C GK 

lt«mt 17,21,47,50 




e 



Figu,« 8^3 Tlie Test Characteristic Curve of a te.t 
composed of four real items. "i^c or a tfc=,t 



0'- 50 
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8.3 If we plot the true test scores against 0, we get a tWt 
characteristic curve (TCC). Figure 8.3 shows the TCC. The TCC ' 
gives the true score for each point on the 9 scale. Notice that 
the TCC is neither a straight line nor an ogive. Each test will 
have its own TCC, which is the sum of the IRF's of the items in 
the test. 

8.4 One of -the interesting uses of the TCC is to determine, the 
distribution of the true scores on the test. Figure 8.4 shows how 
this is done. If the examinees' 0s are normally distributed, as 
shown on 0 (upside down), the examinees' true score.'- will be as shown 
on the left. The true score distribution is found by projecting the 
intervals from the 0 scale onto the TCC, and then representing the 
same area on the true score scale within the projected intervals. 
Figure 8.4 is an excellent demonstration of how the peculiarities of 
a test produce a distorted metric. 

8.5 It is important to note that true scores (T) are not observed 
scores (X). Observed score, is defined as true score plus error 

(X = T + E). However, Lord (1969.) has found that the distribution 
of X will be similar to the distribution of T, but sometimes with 
the high points of the true score distribution flattened somewhat, 
and the low points higher. The flattening is due to error. 
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WO-8 C6K ITEMS 17,21,47+50. 
AFFECT OF TEST CHARACTERISTIC 
ON DISTRIBUTION OF TRUE SCORE 



TRUE SCORE 




Figure 8.4. An illustration of the use of the Test 9 
Characteristic Curve to relate the distributions of 6 
and True Score. ' 
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CHAPTER 9 
The Item Information Function (I IF) 

9.1 We can see in Figure 6.17a that item #17 will not help us to 
distinguish among examinees whose 0's are less than 1.0 because 
they will all get the item wrong. Apparantly, there is something 
about item #17 that leads all examinees with 0< 1.0 to choose 
the wrong alternative. This is an unusual situation, but 
actually occurs with this question. A test made exclusively of items 
like #17 would do nothing to distinguish among examinees with 0< 
1.0 because they would all get zero on the test. It would give us no 
distinguishing information about them. 

Item #17 also gives us no distinguishing information about 
examinees with 0 = 2.7 or greater because they will all get it 
correct. On a test composed >f-+tems like #17, all examinees with 
0> 2.7 would get 100%. 

Between 0=1.0 and 0=2.7, it is a different story. From 0=1.0 
to 0=1.5, P(0) goes from P(0=1 .0)=.00 to- P (9=1. 5) = . 08. The change 
of P(0) means that the item does help to distinguish among examinees 
within the range of 0 where the change of P(0) occurs. In this case 
the difference between the P(0)'s (to be denoted dp) = .08 (.08-.00) 
is small. The change (dp) occurs over a range (d0) of 1/2 0 units 
(1.5-1.0). The ratio of dp to d0 (dp/d0) is equal to the average 
slope of the IRF over the range of d0. For the range from 0=1.0 to 
0=1.5, dp/d0 = .08/. 5 = .16. 
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From 9 = i,5-fctj^2.0 for item #17, P(9) changes from .08 to- 
.78. a verjMarge change, dp = .70 ^-{.78-. 08) in this range, and 
dp/d9 = .70/>C = 1.40, which is very large. Item #17 is an excellent 
item for distinVjijishing among examinees in the range 9 = 1.5 to 9 = 
2.0. A test composed of items like #17 would give scores from about 
8% to 78% for examinees whose 9's go from 1.5 to 2.0. This test • 
would givb :j5 a lot of distinguishing information about 'examinees in 
this range of 9, because it would spread them out over a wide range 
of test~ scores. 

We can see that the greater the slope of the IRF, the more in- 
formation the item gives us about examinees in the range being 
considered. 

9.2 If we could make the range of 9 over which we find the slope 
smaller and smaller, we would eventually get to the slope of the IRF 
at a point which would be the slope of the tangent line to the IRF at 
a particular point of 9. 

The slope of the IRF would be a measure of the relative amount 
of information the item gives about examinees at that point. .The 
greater the slope, the more information. 
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Fortunately, there is an easy way to find the slope of the 
logistic ogive. The slope of the IRF is given by: 




[l + e'-7a(e-b)]2 



where a, b, and c are the item parameters and 9 1s the point 
where dp/d9is the slope. The slope is also sometines denoted as 
P'(9), or P' for short. In calculus P'(9) is known as the first 
derivative of P(9). Since the slope (P') is a measure of information 
it is possible to plot a curve that shows the amount of information 
an iter^ gives at each point on the 9 scale. 

9.3 However, there is a catch. For mathematical and statistical 
reasons which we will not go into, P'(9) is not a completely 
appropriate measure of information, but a related function is. 
The function is: 



exponent of the left e in the denominator is positive, and the 
' exponent of the right e is negative. 



l(e,u)=_t 



(/.7Q)^(/-C) 




where P"" is P' squared, and Q(9) = 1 



P(9). Note that the 



6S 




J 

Figur^ 9.4a. The Item Information Fimctions of four 
real items. 
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That function is called the Item Information Function (IIF), and 
is written 1(0, u). The' above formula for 1(0, u) may look even more 
ominous than the formula for P(0), but in fact it is only slightly 
more complicated. It is still feasible to calculate points of 
1(0, u) with a typical scientific hand calculator. 

\ 

9.4 Figure 9.4a shows the 1(0, u) for the four items whose IRF's are 
shown in Figure 6.17. (Note that the vertical scale for'^item #17 is 
different from the others.) In comparing, the IRFs with the IIFs, 
you will note three important relationships. 

(1) The IIF is highest close tJ where the slope of, the IRF is 
steepest. 

(2) The total area under the IIP increases as the a-value 
increases. i 

(3) The totalarea under the IIF decreases as the c-valu^ 
increases. 



The fact that total information (i.e. total area under the IIF) 
increases as the a-value increases, demonstrates the importance of 
high a- values for items. However, there is another effect of high 
a-values. As the a-value increases, the width of the 0 scale over 
which the information is distributed decreases. The effect is called, 
the bandwidth paradox*. Thus, sometimes a Compromise must be made 
between the total information provided by the item and the distri- 
bution of information over 0. 

*This bandwidth paradox is different from the bandwidth paradox 
described by Cronbach (1960, p. 602). 

{ 

I 
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The total information (Ag) of item g is given by 

• AO ; _i:Z5l£j2i£±OiC)) ,7a + iiMioS_c^|^(f+clo3^ 

«y ■ = |_c • \ i-c / 

I - c 

where a and c are the item parameters and log c is the natural log- 
arithm of c. From inspection of the formula for A^, you can see that 
as the a-value increases, so does A^. Also apparent is the fact that, 
as c approaches zero, A^ approaches 1.7a. Therefore, the maximum 
total information an item can provide is 1.7a. Not so obvious from 
the formula for A^ is the relation that, as c approaches 1.00, A 
approaches zero. This occurs because log c is negative except when c 
= 1, and because when c = 1, c log c/(l-c) = -l. This relation 
explains the effect of the c-value: the c-value destroys information. 
Figure S.4b shows how total information decreases as c increases while 
holding the a-value constantc, 

Since the b-value is not included in formula for Ag, the b-value 
does not affect the total information. 

9.5' The point on 0 where the IIF is highest is not at the b-value, 
as one might expect (except :.when c=0). The point on 0 where informa- 
tion is greatest is given by 



where "log" m6ans the natural logarithm. 
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The point on 9 where inforrration is raxirized is always to the 
rifht of the b-value, (except when c=^, it is at the t>W&-lue), but 
never farther to the right than .41/a. 

9.C The IIF is symmetrical when c=0 and skewed to the right when 
cJ^O. The larger is c, the greater the right-skew. The right-skew 
OCCURS because the c-value destroys more inforrnation at low levels 
of Oj^than at high levels. This result makes sense because examinees 
at low '6s will guess more than examinees at high 0s. Guessing (i.e. 
the opportunity to get the item correct by guessing) destroys infor- 
mation. It is for this reason that five-choice itenis are preferred to 
four-choice items. 
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Figure 10.2 a, b, and c. The Test Information 
Curve of . (10.2a) a test 'composed of items # 17 
and # 21, (10.2b) a test composed of items #17, 
# 21, and # 47, and (10.2c) a test composed of 
Items # 17, # 21, # 47, and # 50 from the USCG 
Warrant Officer Test. 
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CHAPTER 10 

The Test Information Curve and Relative Efficiency Curve. 



10.1 The Test Information Curve (TIC) is nothing more than the sum of 
the IIFs. llFs are summed by "stacking them on top of each other." 
"Stacking" llFs merely means that the heights (i.e. the amount of 
information) of the IIFs at a particular value of 9 are added together 
toW the^eight of the TIC at that value of 9. Plotting the sum of 
item information at each value of 9 gives the TIC. The height of the 
TIC at 9 is written as 1(9). 



10.2 Figure 10.2a shows the sum of the.,IIFs for items #17 and 21 as 
shown in Figure 9.4a. Figure 10.2b shows the IIF of item #47 added to 
Figure 10.2a. Figure 10.2c shows the IIF of item #50 added to the 
other 3 items. A test composed of these four items would have the 
wierd TIC in Figure 10.2c. 

10.3 The TIC shows the relative amounts of information provided by 
the test at each point on 9. Where you want information depends on 
what you will use the test for. If you want to select a few examinees 
from a large number, then you want a lot of information at high 'levels 
of 9, so that you car> tell just which examinees are tfie best. For 
example, see Figure 10.3a. If you want to select all examinees except 
a few, then you want a lot of information at low 9s so you can tell 
which examinees are the worst (e.g. see Figure 10.3b). 
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Figures 10.3c The Test Information Curve of a hypo- 
thetical test, which would be efficient at both high 
and low cut-scores. 
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r Sometimes a test is desigrjed for more than one purpose, such as 
to be used with two cut scores. for entrance into two different 
schools. In this case a two-humped TIC wilT give, good information at 
the two cut .scores, (e.g. see Figure .10.3c) . ' • 

A TIC of any desired shape may be constructed, provided the 
items with the necessary IIFs are available to construct the TIC. 

10.4 Usually we a^lread^^ a test and want to revise it to make it 
better serve our purpose. A comparision of the new and old versions 
should be made Using the Relative Effi ciency Curve (REC). The REC is 
nothing more than the ratio of the TICs. The ratio of the two curves 
is found by di>(rding the 1(0) of one test by the 1(0) of the other 
test at each poirit on 0. Figure 10.4 is the REC, coiT|)ari^g the TIC . 
in Figure 10.3c \o the TIC in Figure 10;3b. \^ 

Where the REC is above 1.0, the ftest in Figure 10.3c(the test 
for which the 1(0) is the numerator of\he REC ratio) is better than 
the test for Figure 10.3b. Where the REC 1^ below l.oTlERi test for 
Figure 10.3b is better. And where the REC ^ 1.0, the two tests are 
the same. 

By starting with an old test, making substitutions of items, and 
calculating the REC,, you can experiment with and improve the old test 
by tria2^ and. error. It does not take long to develop some skill in 
replacing items, to improve the TIC as desired. 

1G.5 Every test has some error in it. The Standard Error of E'stimate 
(S.E.E.) is the expected standard deviation of errors of estimated 
ability. That is, if we were to give a test to a group of examinees 
with identical 0s, and estimate their *0s with the test, the standard 
deviation of those estimates would be the S.E.E. 
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10.6 If the estimate of 9 1s a maximum likelihood estimate (see Chapter 12), 

the S.E.E. at a particular 0 1s easy to calculate from the TIC, Ths'S.E.E. 

Is equal to the square root of the reciprocal of the height of the TIC (I (0)), 



Since 1(9) varies along the 0 scale, so will the S.E.E. The 
larger 1(9) is, the srraller the S.E.E. A srall S.E.E. at a cut point:' 
highly desirable. 

10.7 The average S.E.E. TsXTJ over examinees 1s related to the 
reliability of Classical Test Theory (r^^), when the scores are 
standardized to a standard deviation ■ 1.0." 



This relation- i,mplies that a test with high reliability n:ay be a 
poor test for your purposes -because it has low inforimtion at the 
critical values of 9. Sir^ilarly, a test with low reliability r,«y be an 
excellent test for some purposes, if it hds high information where it ' 
is needed. Thus, reliability is high]y misleading as to the value of a 
test. 

The relation also makes clear the dependence of reliability on the 
distribution of ability. If many examinees are on the 9 scale where 
there is high information," then the reliability will be higher than if 
they are distributed on 9 at points where information is low. 



/ 
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CHAPTER. 11 
The Score Information Curve 



11.1 The test. i'lformation curve (1(9)) gives the maximum amount of 
information about 9 that can be extracted from the test. However, to 
get the maximum information, items must be optimally weighed. The 
optimal weight (W(9)) of an 'item is given by 



0+ e 



'There is a curious characteristic of W(9). It varies with 
That means that item A should receive different weights for examinees 
with different 9s. But^o get W(9), you must know 9, which is what 
you dre trying to get by giving the test. 

11.1^ There are two ways to approach this dilemma. 

(iT^The most satisfactory way is to use an iterative computer 
program, such as LOGIST or OGIVIA (see Chap.VlS). These computer 
programs, in effect, make use of the optimal item weights and 
hence yield. maximum in-formation abput 9. 

(2) A rough approximation would be to take raw scores on the 
test, divide the distribution of raw scores into, say, top, 
middle and bottom groups and then .rescore using different 
item weights for' each group. This procedure would not yield 
maximum information, but would provide more information than 
not using variable item. weights at all. 



11.3 If neither of the options in Section 11.2 is possible, then you 
may have to resort to the use of nuinber- right score. In this case 
the ^amount of "^'nformation provided by this scoring procedure becomes 
of interest. The anxDunt of information provided by a number-right 
score is called the number-right Score Information Curve (SIC). The 
formula for the SIC (also written as 1(0, X)) is 



iri4 The SIC usually has the same generH shape as the TIC, but is 
lower than the TIC at a'"ll values of 0. At high 0 the TIC and SIC will 
be nearly the same height (i.e. SIC/TIC ^ssLO). As 0 becomes smaller 
and smaller, SIC/TIC becomes smaller. This result means that, at high 
0s little information is lost by using a number-right score, but at low 
0s relatively much information is lost. Such is the penalty for us^of 
the inefficient number-^right scor^ 



)/^s of 



U.5 The^Cs of two tests may be used just as the TICs are used. A rough 
approximation of the standard error of estimate may be found for each 0 using 
the number-right scoring procedure, and the ratio of the SICs of two number- 
right scored tests may be interpreted in the same manner as the .Relative 
Efficiency Curve for TICs. (Strictly speaking, for this interpretation 
to be legitimate, the test score must be shown to be an unbiased 
estimate of 0.) 

11.6 The SIC is plotted by a computer program available. from the Educational 
Testing Service (See Chapter 15), and may be derived from a program by John 
Gugel (see Section 15.4). . 



78 



80 



CHAPTER 12 

e 

Maximum Likelihood Estimation of 9 

12.1 There are two main ways in IRT to estimate an examinee's 9. 
^ They are called the Maximum Likelihood Estimation method and the 

Bj^yesian Modal Estimation method. Both methods use the actual re- 
sponse pattern of the examinee rather than the raw score. The differ- 
ence between the two methods is merely'an additional assumption made by the 
Bc^yesian method. . . 

12.2 A response is Indicated by the lower case letter u. If the examinee 
gets item i correct, then u^-=l, and if he gets it wrong, then u^-=0. A 
response pattern is also called a response vector, and is represented by 
the uppercase letter U. A response pattern is a list of zeroes and ones, 
indicating which questions the examinee got correct or wron^ in the order 
the items appear in the test. For example; in a four-item test, an exam- 
inee who got the first two items correct and the last two wrong would have 
a response pattern U = 1100. If he got the first and third items correct 
and the other two items wrong, his response pattern would be U = 1010. If 
he got the first three wrong and the last item correct, he would have a 
response pattern U = 0001. 

12.3 We recall that P^.(9) is the' probability that an examinee with 
ability 9 will get item i correct. Q',-(0) is the probability that an 
examinee with ability 9^(ill get item i wrong. Q^. (9)=i-p^. (9). ^,11 
abbreviate P. (9) and Q,-(9) by P^- and Q^-. 
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12.4 Probability theory tells us that the probability of independent 
events occurring together is equal to the product of their separate 
probabilities. We knov/ that the probability of getting one item 
correct or wrong is independent of -the probability of getting other 
iters correct or v/rong for any given value of 0. We knov/ this because 
of the assunption of local independence.* 



\ 12.5 Therefore, the probability of an examinee getting item 1 correct 
^^~a^d item 2 wrong is PjQg. The probability of getting both items wrong 
V — QjQg. • Getting item 1 correct and item 2 wrong is the response 

pattern U=10. Therefore, P(U=10)=P jQ2, P(U=00)=QiQ2, P(U=01)=QiP2, 

and P(U=ll)=PjP2. 

Similarly, for three items for a given 0, if: 





''1 = 


CO 


^1 = 


.7 


V. 




= .6 


^2 = 


.4 




"3 = 


■■ .8 


^3 = 


.2 




*The assurnption of local independence vnll be discussed in Sec. 14,3. 
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then 



U' L(U|9) = Likelil-iood 7^ ^"Q/'" 



000 Q^Q^Qa = .7 x .4 x .2 = .056 

001 g^Q^P^ = .7 X .4 X .8 = .224 
0.10 Q^P^Qj = .7 X .6 X..2 = .084 

100 P,Q2Q3 = .3 X .4 X .2 = .024 
Oil Q^Pj^P^ = .7 X .6 X .8 = .336 

101 P, Q^Pj = .3 X .4 X .8 = .096 

110 Pf P^Q^ = .3 X .6 X .2 = .036 

111 ?! P2P3 = .3 X .6 X .8 = .144 

Table 12.5 



The 'ikelihood of each possible response pattern for a 
given e where the P,(^) is as given in Section 12.5. 

12.6 These probabilities are called likelihoods (and written L(U|e)), 

Each likelihood is the conditional probability of a response 
pattern (U) given 6, i.e. L(Uj0). The general formula for a like- 
lihood is 

L(u|0)--Tr" Pi^Ql'"" 
i=l 



83 



.The upper case Greek letter 7?^means the product of ^TKthe ^"Q}' 
where i goes from 1 to n (n = the # of items in the test)i just as 
in statistical notation X.^means the sum of a series of /lumbers 
where i goes from 1 tc n. 



When u. = 1 



When u. = 0 



When u 



^ = 1, the Q^. drops out, and when u^. = 0, the^P^. drops out. 



Thus, "P^CJ^'^is just a convenient mathematical way of getting rid of 
the P or Q depending on the value of u.. For a three-item test the 
likelihood of U = Oil, 
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#1 .' >2 ■ ' ■ t3 



-3.0 1 


.29 


.71 1 


.36 


.64 


' .21 


.79 


-2.5 


.32 


.68^ 


1 .39 


.61 


t .22 


.78 


i 

-2.0' 


.37 


.63 


1 .45 


.55 


' .25 


.75 


-1.5' 


.50 


.50 ( 


.60 


.40 


f . 30 


.70 


-1.0 


.62 


.38 


' .77 


.23 


.38 

» 


.62 


-0.5 


.77 


.23 




.10 


' .50 


.50 


0.0 


.88 


.12 


.97 

1 


.03 




.41 


0,5 ' 


.93 


.07 


1 

.99 


.01 


|.70 


.30 


1.0 ( 


.97 


.03 1 


1 .99 


.01 


,.79 


.21 




.98 


.02 ; 


.99 


.01 


..87 


.13 


2.0 
2.5 


.99 


.01 ; 


.99 


.01 


, -91 


.09 


.99 


.01 


.99 


.01 


, 35 


.05 



JJilfPloliL^ ^ jXqM 



. 71 


X 


.36 


X 


.-79 = 


. 202 \ 


.169 


.68 


X 


.39 


X. 


.78 = 


.207 


.173 


.63 


X 


.45 


X 


.75 = 


.213 


.178 


.50 


X 


.60 


X 


.70 =• 


.210 


.176 


. 38 


X 


.77 


X 


.62 = 


.181 


.151 


.23 


X 


.90 


X 


.50 = 


.104 


.087 


.12 


X 


.97 


X 


.41 = 


.048 


.040 


.07 


X 


.99 


X 


.30 = 


.021 


.018 


.03 


X 


.99 


X 


.21 = 


.006 


.001 


.02 


X 


.99 


X 


.13 = 


.003 


.000 


.01 


X 


.99 


X 


.09 = 


.000 


.000 


.01 


X 


.99 


X 


.05 = 


.000 


.000 



XL(U|9) * 1.195 1.000 



Table 12.7 

The method of calculating the Maximum Likelihood 
Estimate of $ from a test of 3 items for an examinee 
with the response pattern, U = 010. 
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12.7 When we give a test, we get each examinee's response pattern, 
and we want his 0, L(U|0) is not what we want, since we already have 
U. What would help us estimate an examinee's 0 is just the reverse, 
i.e. L(0|u). 

Fortunately, Bayes' Theorem allows us to get 1(0)1]) from L(uf0). 

' sL(u|e) 

To use Bayes' Theorem we have to get the L(U|0) at several points on 
the 0 scale. How many points we use is determined by how accurately 
we want to estimate 0. 

To show how this is done, L(U=O1O(0) is calculated in Table 12.7 
for three hypothetical items at 12 values of 0. 

The total of the L(U|e)s isZKUie) . The rinht column shows 
L(e|u)=L(ul9)/ZL(u|e). Any examinee, no natter what his 9, could 
conceivably have a U = 010 in this three-item test. There is a finite 
probab-ility of U = 010 at every 0. 

However, the likelihood of an examinee having U = 010 varies 
considerably with 0. An examinee with 02:0.0 is unlikely to have 
U = 010. In fact, only 6% of examinees with 0;fe:O.O will have U = 010. 

Note: The proponents of Maximum Likelihood Estimation do' not agree with 
the use of Bayes' Theorem in this explanation. 



A graph of the likelihoods (for U = 010) would look like Figure 

12.7 




Figure 12.7. The graph of the likelihoods in Table 
12.7, called the likelihood function. 

This curve is called the likelihood function. 

If you had to guess the 9 of an examinee with U = 010, what 0 
would you guess from the information in Table 12.7? You should guess 
his 0 = -2.0 because, the likelihood of I! = 010 is greater at 0 = -2.0 
than at any other 0. Therefore, you would be right more often than if 
you guessed any other 0. By choosing the 0 with the greatest likeli- 
hood, you have chosen the 0 with the raximum likelihood. And that is 
the Maximum Likelihood method of estimating 0! That's all there is to 
it. 
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Now look at the L(U|0) colun:n. At whirfi value of 0 is L(u|0) 
greatest? It is at 0 = -2.0, the same as the 0 with the maximin L(0| 
U). That will always be the case because the L(0f'J)'s are just the " 
L(l}0)'s divided by the constant ^L(u|e). So the -0 with themaximun 
L(0|U) will always be the sane as the 0 with the maximum L(U/0). 
.Therefore, it is not necessary to divide by EL(U|0) in order to find 
the 0 with the maxipum likelihood. 

Sinde we divided by 2 L(u|0) in order to apply Bayes' Theorem, 
we find that Bayes' Theorem is not necessary for raximum likelihood 
estimation. 

Another short cut is to take the logarithm of the P| and Q^. 's 
and add them, instead of multiplying the P^-'s and Q/s. The sum of the 

logarithms will also always be maximum at the same value of G. A graph 
of the log likelihoods is called the log likelihood function. The log 
likelihood function will always be highest at the same 0 at which the 
likelihood function is highest. 

It should be noted that, in this exarple, you, would be right 
in estimating 0 = -2.0 only 17.8% of the time and wrong 82.2% of the 
time. But this is true only because the test had only three items. 
With a longer test there would be one G at which the likelihood is 
much greater than any other. 

12.8 Table 12. G shows the maximum likelihood method of estimat-'nr; 
0 for a test made of the four items whose IRF's ere shown in Figure 
6.17. 

(1) across the top are 17 values of 0 

(2) under the 0's are the P(0)'s -for each of the four items. 

(3) the item numbers eiid parameters are in the top left corner. 

(4) down the left side are the 16 possible response patterns for 
four items and the raw (# right) score represented by the response 
patterns. 
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1000 . 
Table 12.8 



An illustration of the MLE of Hor all possible re- 
sponse patterns from a test e,7iposed of four real 
ira. [Ai likelihoods are multiplied bv lOOfl to 
reduce decimal values), 



(5) in the body of the table are the L(Uf0)'s for each 
possible U for the 17 values of 9. Each L(U|9) is 
multiplied by 1000 to eliminate decimal values. 

(6) underlined in each row is the maximum !.(u(9) 

(7) down the right side are the values of 9 where the 
underlined maximum likelihoods occur. These 9's aro the 
maximum likehood estimates (MLE) of 9 for each of the 16 
possible U. 

Note that the fILE for U = 0000 is - oo, and the MLE for U = 1111 
is + 00. That is a characteristic of the MLE. The MLE will not give a 
finite estimate of 9 unless the examinee has missed at least one item 
and answered at least one item correctly. This limitation is not 
serious because raw scores of 0% or 100% are usually rare. 

The MLE of 9> 2.7 is due to the limited range of 9 used in this 
example. A larger range of 9 would yield a more precise MLE of 9. 

The many cells with L(U|9) = 0 in the body of Table 12.8 are due 
to the very unusual item #17. 

12.9 Now compare in Table 12.8 the raw scores on the left with the 
MLE's on the right. You can see that a raw score of 1 represents 
9s from -2.3 to +2.0, an extreme rangei A raw score of 2 represents 
Os from -1.3 to greater than +2.7. A raw score of 3 represents 9's 
from +1.3 to greater than +2.7. 

The extreme range of 9, depending on the U's represented by a 
single raw score, demonstrates well the inadequacy of using raw 
score as an estimate of ability. The inadequacy of raw score as an 
es.imate of ability is due to the fact that raw score cannot dis- 
tinguish chance success from knowledge success on an item. In 
contrast, the MLE takes guessing into account by using the additional 
infoiRtation in the response pattern, 
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CHAPTER 13 



Bayesian Modal Estimation of 9 

13.1 The Bayesian Modal method of estimating 9 takes up where the MLE 
stops. The proponents of the Bayesian Modal method (called Bayesians) 
reason that if the distribution of 9 is known or assumed, then that 
knowledge or- assimption provides additional information which can be 
used to more accurately estimate 9. 

13.2 Bayesians assume that 9 is distributed normally. The assumption 
of normality means that the probability of any randomly-chosen examinee 
having a 9 at the extremes is less than his probability of having a 

9 located near the mean. The assumption of nornality is made on an a 
priori basis (i.e. before empirical evidence). Thus, it is called the 
normal "prior" distribution. 

13.3 Suppose the likelihood of Q-^ju is very close to the likelihood of 
^zi^* but that there rire many more examinee's at 92 than at 9^. In 
this case we would be right more often by estimating 9 at Q2 than at 
9^. In doing so v/e v/ould, in effect, be weighting our likelihood by 
the number of examinees at the two 9 values. If we take this idea to 
its logical extreme, we should weight all likelihoods by the proportion 
of examinees at each value of 9 in order to reduce our errors. 

13.4 By assuming a normal distribution of 9 we can weight the like- 
lihood by the relative proportions of area under the normal' curve. 

To do this we merely multiply the area v/ith1n the interval of the normal curve ' 
at 9, designated jTNv'cD, times L(U|9). Ta&ie 13.4 shows how this is done 
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Table 13.<1 ^ 

».; 

An illustration of the Bayesian Modal Estimate of I 
for al' passible response patterns from a test com- 
wsed of four real items, (All likelihoods are multiplied 
)y 10,000 to reduce decimal values). 
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using the likelihoods from Table 12.8. 



(1) . the top row are points of 9 which are midpoints of 
intervals of 9. 

(2) the 2nd and 3rd rows are the limits of the intervals. 

(3) the 4th row is the proportion of area under the normal 
curve and within the interval. 

(4) in the body of the table e^ch column is the area in the 4th 
row multiplied by the corresponding likelijbood from Table 12.8 
(times 100,000 to remove decimal values, i.e., L(U|9) x^(0,l)J, 

(5) ' the largest value in each row is underlined. 



(6) the 9 for the underlined likelihoods 



are in the right 



column. These are the Bayesian Modal Estimatas (BNE) of 9. 

The BME is called modal because, when v/e choose the largest value 
in each row, we are choosing the mode of the distribution of L(uj9) x 
/N(0,1). 



13.5 Bayesian Kodal Estimates are more conservative than MLEs (con- 
servative means closer to zero, the mean of the normal prior distri 
bution). Note that with U=0000 and U=llll, the BMEs of 9 are 
finite. The finiteness of 9 estimates of BME when either all or 
no items ar? ansv/ered correctly is a minor advantage of^ME. 

"*Note: There are several computational errors in Table 13.4. Ho 

ft 

these errors do not affect the explanation of the concepts involve 
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13.6 There is an active controversy between the Bayesians and the 
proponents of the MLE. The Bayesians argue that MLE ia the same as 

. a BME, If 6 is assumed to be distributed rectangularly. (A rectan- 
gular distribution of 0 means that there are equal numbers of exam- 
inees at all 0 values, even at +oo and -oo). And so, say the Bayesians, 
since a normal distribution of 0 is more reasonable to assume than a 
rectangular distribution, the BME is a more accurate estimate of 9. 

ThP proponents of MLE argue that the coincidence of the MLE 
(which assumes no distribution of 0) being the same as a BME with 
rectangular distribution is irrelevant. The important thing is that 
MLE makes no assumption about the distribution of 0, whereas BME makes 
the additional assumption, which will be sometimes- fal se.* 

13.7 I shall not take sides in this jnatter, because for me the point 
is moot. The only computer program available to me at present is 
OGIVIA-3 (See Chap. .15), which uses the BME. Therefore, I shall 
continue to use BME until I have a program which uses MLE. At that 
time I shall have to make_a ^decision. 

13.8 Another type of Bayesian estimation is called Owen's Bayesian, 
after its inventor, R. L. Owen (1975). The Owen's Bayesian method 

used primarily in tailored testing (See Chap. 17). 



*I apologize to both sides of this complex issue for this meager 
representation of their positions. "^-^ 
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CHAPTER 14 
Assumptions 



14.1 There are 4 basic assumptions of IRT. The first of these is a 
minor assumption. is an assumption of any test theory and withoi-. 
which there would" be no justification for testing. 

Assumption ^1: The Know-Correct Assumption: if the examinee 
knows the correct answer to the item, he will answer it correctly.* 
We have probably all violated this assumption while taking tests by 
marking a different choice than we intended to mark. Occasionally, 
an examinee wil 1 inadvertently skip an item, and then mark a,ll the- 
restf of his-" answer^--4fr4he wrong places. This is merely a clerical 
error, but there is no provision for it in any test theory. Another 
way to state the first assumption is: if he got the item wrong, 
then he did not know "the answer. . ' , . 

14.2 Assumption #2: The Normal Ogive Assumption: The IRF takes the 
form of the nor(nal ogive. This is the problem, mentioned in Section 
3.3, which deterred Lord's work for 10 years. The difficulty lay with 
3 parts of the IRF. 

a. The lower asymptote 

b. The upper asymptote 

c. The middle or rapidly rising part of the IRF 



*rhe reader should take careful note that the inverse of this assump- 
tion is NOT made, ^hat is, it is NOT ASSUMED that if the examinee 
gets the item correct^ he knows the answer^ I emphasize this distinc- 
tion because many persons upon first reading of assumption #1 misread 
It as its inverse. 



(1) As previously noted, the c-value of an IRF is often not 
1/A. This is the case with observed parts of the- lower asymptote. 
But what about the unobserve.d part^-? if an item from the SAT with 
c = .09 were given to extremely low 0 persons such as kindergarten 
children or mentally retarded persons, would the lower tail of the 
IRF rise to 1/A? 

(2) It has been charged by, Hoffman '19G2), that tests may 
penalize extremely high ability persons, because they know too mjch. 
That is, they consider factors far beyond the intended scope of the 
iterr, and therefore get it wrong. If that were the case, then -the IRF 
wouW curve down away from the upper asynptote at high 0's. This has 
been called the Banesh Hoffmann Effect. 

(3) It was not known that the IRF was monotonic, and that its 
general shape was that of a normal ogive. 

o 

In 1965 Lord published a massiVe study with a sample size greater 
than 100,000. Specifically, found: 

a. the lower tail of the IRF did not rise for almost all. items. 
The very few items that did rise, did so to a very small 
extent. . , • 

b. no /Evidence of the Banesh Hoffman Effect. 

c. good indications that the IRF i| i^trictly monotonic. 
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